AITopics | image encoder

Collaborating Authors

image encoder

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

GeoLink: Empowering Remote Sensing Foundation Model with OpenStreetMap Data

Neural Information Processing SystemsJun-23-2026, 03:12:46 GMT

Integrating ground-level geospatial data with rich geographic context, like OpenStreetMap (OSM), into remote sensing (RS) foundation models (FMs) is essential for advancing geospatial intelligence and supporting a broad spectrum of tasks. However, modality gap between RS and OSM data, including differences in data structure, content, and spatial granularity, makes effective synergy highly challenging, and most existing RSFMs focus on imagery alone. To this end, this study presents GeoLink, a multimodal framework that leverages OSM data to enhance RSFM during both the pretraining and downstream task stages. Specifically, GeoLink enhances RS self-supervised pretraining using multi-granularity learning signals derived from OSM data, guided by cross-modal spatial correlations for information interaction and collaboration. It also introduces image maskreconstruction to enable sparse input for efficient pretraining. For downstream tasks, GeoLink generates both unimodal and multimodal fine-grained encodings to support a wide range of applications, from common RS interpretation tasks like land cover classification to more comprehensive geographic tasks like urban function zone mapping. Extensive experiments show that incorporating OSM data during pretraining enhances the performance of the RS image encoder, while fusing RS and OSM data in downstream tasks improves the FM's adaptability to complex geographic scenarios. These results underscore the potential of multimodal synergy in advancing high-level geospatial artificial intelligence. Moreover, we find that spatial correlation plays a crucial role in enabling effective multimodal geospatial data integration.

artificial intelligence, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Country:

North America > United States (0.68)
Asia > China (0.47)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Energy > Renewable > Geothermal > Geothermal Energy Exploration and Development > Geophysical Analysis & Survey (0.73)
Information Technology > Services (0.48)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Geographic Information Systems (1.00)
Information Technology > Data Science (1.00)
(4 more...)

Add feedback

REN: Fast and Efficient Region Encodings from Patch-Based Image Encoders

Neural Information Processing SystemsJun-20-2026, 23:39:32 GMT

We introduce the Region Encoder Network (REN), a fast and effective model for generating region-based image representations using point prompts. Recent methods combine class-agnostic segmenters (e.g., SAM) with patch-based image encoders (e.g., DINO) to produce compact and effective region representations, but they suffer from high computational cost due to the segmentation step. REN bypasses this bottleneck using a lightweight module that directly generates region tokens, enabling 60 faster token generation with 35 less memory, while also improving token quality. It uses a few cross-attention blocks that take point prompts as queries and features from a patch-based image encoder as keys and values to produce region tokens that correspond to the prompted objects. We train REN with three popular encoders--DINO, DINOv2, and OpenCLIP--and show that it can be extended to other encoders without dedicated training. We evaluate REN on semantic segmentation and retrieval tasks, where it consistently outperforms the original encoders in both performance and compactness, and matches or exceeds SAMbased region methods while being significantly faster. Notably, REN achieves state-of-the-art results on the challenging Ego4DVQ2D benchmark and outperforms proprietary LMMs on Visual Haystacks' single-needle challenge. The code and pretrained models are available at https://github.com/savya08/ren.

artificial intelligence, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Country: North America > United States (0.67)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Information Technology (0.67)
Government (0.46)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(2 more...)

Add feedback

Natural vs Ultrasound Video Normal Adult Heart

Neural Information Processing SystemsJun-17-2026, 11:21:55 GMT

Self-supervised learning (SSL) has achieved major advances in natural images and video understanding, but challenges remain in domains like echocardiography (heart ultrasound) due to subtle anatomical structures, complex temporal dynamics, and the current lack of domain-specific pre-trained models. Existing SSL approaches such as contrastive, masked modeling, and clustering-based methods struggle with high intersample similarity, sensitivity to low PSNR inputs common in ultrasound, or aggressive augmentations that distort clinically relevant features.

artificial intelligence, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Country: Europe (0.46)

Genre: Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (1.00)
Health & Medicine > Diagnostic Medicine > Imaging (0.93)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
(4 more...)

Add feedback

un2CLIP: Improving CLIP's Visual Detail Capturing Ability via Inverting unCLIP

Neural Information Processing SystemsJun-15-2026, 18:59:37 GMT

Contrastive Language-Image Pre-training (CLIP) has become a foundation model and has been applied to various vision and multimodal tasks. However, recent works indicate that CLIP falls short in distinguishing detailed differences in images and shows suboptimal performance on dense-prediction and vision-centric multimodal tasks. Therefore, this work focuses on improving existing CLIP models, aiming to capture as many visual details in images as possible. We find that a specific type of generative models, unCLIP, provides a suitable framework for achieving our goal. Specifically, unCLIP trains an image generator conditioned on the CLIP image embedding.

artificial intelligence, machine learning, natural language, (17 more...)

Neural Information Processing Systems

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(2 more...)

Add feedback

Robustness in Both Domains: CLIP Needs a Robust Text Encoder

Neural Information Processing SystemsJun-14-2026, 07:52:56 GMT

Adversarial input attacks can cause a significant shift of CLIP embeddings. This can affect the downstream robustness of models incorporating CLIP in the pipeline, such as text-to-image generative models or large vision language models. While some efforts have been done towards making the CLIP image encoders robust, the robustness of text encoders remains unexplored. In this work, we cover this gap in the literature. We propose LEAF: an efficient adversarial finetuning method for the text domain, with the ability to scale to large CLIP models. Our models significantly improve the zero-shot adversarial accuracy in the text domain, while maintaining the vision performance provided by robust image encoders. When combined with text-to-image diffusion models, we can improve the generation quality under adversarial noise. In multimodal retrieval tasks, LEAF improves the recall under adversarial noise over standard CLIP models. Finally, we show that robust text encoders facilitate better reconstruction of input text from its embedding via direct optimization.

artificial intelligence, machine learning, natural language, (8 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Natural Language (0.98)
Information Technology > Artificial Intelligence > Machine Learning (0.78)
Information Technology > Artificial Intelligence > Vision (0.60)

Add feedback

Latent Zoning Network: A Unified Principle for Generative Modeling, Representation Learning, and Classification

Neural Information Processing SystemsJun-14-2026, 05:27:21 GMT

Generative modeling, representation learning, and classification are three core problems in machine learning (ML), yet their state-of-the-art (SoTA) solutions remain largely disjoint. In this paper, we ask: Can a unified principle address all three? Such unification could simplify ML pipelines and foster greater synergy across tasks. We introduce Latent Zoning Network (LZN) as a step toward this goal. At its core, LZN creates a shared Gaussian latent space that encodes information across all tasks. Each data type (e.g., images, text, labels) is equipped with an encoder that maps samples to disjoint latent zones, and a decoder that maps latents back to data. ML tasks are expressed as compositions of these encoders and decoders: for example, label-conditional image generation uses a label encoder and image decoder; image embedding uses an image encoder; classification uses an image encoder and label decoder. We demonstrate the promise of LZN in three increasingly complex scenarios: (1) LZN can enhance existing models (image generation): When combined with the SoTA Rectified Flow model, LZN improves FID on CIFAR10 from 2.76 to 2.59--without modifying the training objective.

artificial intelligence, machine learning, proceedings, (9 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Self-supervised Learning of Echocardiographic Video Representations via Online Cluster Distillation

Neural Information Processing SystemsJun-12-2026, 07:20:35 GMT

artificial intelligence, machine learning, proceedings, (8 more...)

Neural Information Processing Systems

Industry: Health & Medicine > Therapeutic Area (0.42)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.59)

Add feedback

un 2 CLIP: Improving CLIP's Visual Detail Capturing Ability via Inverting unCLIP

Neural Information Processing SystemsJun-11-2026, 07:42:13 GMT

artificial intelligence, machine learning, natural language, (9 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Natural Language (0.58)
Information Technology > Artificial Intelligence > Machine Learning (0.39)

Add feedback

1b57aaddf85ab01a2445a79c9edc1f4b-Paper-Conference.pdf

Neural Information Processing SystemsApr-25-2026, 12:56:46 GMT

artificial intelligence, deep learning, machine learning, (17 more...)

Neural Information Processing Systems

Country:

Europe (0.67)
North America > United States (0.28)

Genre: Research Report (0.93)

Industry: Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Vision (0.96)
Information Technology > Sensing and Signal Processing > Image Processing (0.94)

Add feedback

Overview

Neural Information Processing SystemsApr-25-2026, 08:28:40 GMT

In this section, we mainly introduce the axiomatic properties of Shapley value. Weber et al. [17] have proved that Shapley value is the unique metric that satisfies the following axioms: Linearity, Symmetry, Dummy, and Efficiency. If two independent games u and v can be linearly merged into one game w(S) = u(S)+v(S), then the Shapley value of each player i N in the new game w is the sum of Shapley values of the player i in the game uand v, which can be formulated as: ϕw(i|N) = ϕu(i|N)+ϕv(i|N) (1) Symmetry Axiom. Considering two players i and j in a game v, if they satisfy: S N \{i,j},v(S {i}) = v(S {j}) (2) then ϕv(i|N) = ϕv(j|N). The dummy player is defined as the player that has no interaction with other players. Formally, if a player i in a game v satisfies: S N \{i},v(S {i}) = v(S)+v({i}) (3) then this player is defined as the dummy player.

artificial intelligence, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Game Theory (0.96)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.47)

Add feedback